Voice Conversion — Chatterbox VC

Generated 2026-02-14 22:31

Chatterbox VC by Resemble AI is a zero-shot voice conversion model.

It encodes the source audio into discrete S3 speech tokens (capturing content and prosody), extracts a speaker embedding from a short reference clip, then decodes a new waveform via a flow-matching model that sounds like the target speaker saying the source content.

No training or fine-tuning needed — just a few seconds of reference audio.

Target voice (chris-ref.mp3)

juniper-long-en.wav → juniper-long-en_as_chris-ref.wav

Source (original voice)
Converted (target voice)

23.8s · converted in 1.8s